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Abstract 

Conformal predictors are set predictors that are automatically valid 
in the sense of having coverage probability equal to or exceeding a given 
confidence level. Inductive conformal predictors are a computationally 
efficient version of conformal predictors satisfying the same property of 
validity. However, inductive conformal predictors have been only known 
to control unconditional coverage probability. This paper explores various 
versions of conditional validity and various ways to achieve them using 
inductive conformal predictors and their modifications. 

1 Introduction 

This paper continues study of the method of conformal prediction, introduced 



in Vovk et al. 


et al. 


( 


2005 


)• 



set rather than point predictions) automatically satisfy a finite-sample property 
of validity. Its disadvantage is its relative computational inefficiency in many 
situations. A modification of conformal predictors, called inductive conformal 
predictors, was proposed in Papadopoulos et al. ( 2002b|a ) with the purpose of 



improving on the computational efficiency of conformal predictors. 

Most of the literature on conformal prediction studies the behavior of set 
predictors in the online mode of prediction, perhaps because the property of 
validity can be stated in an especially strong form in the on-line mode (as 
first shown in |Vovk||2002| ) . The online mode, however, is much less popular in 
applications of machine learning than the batch mode of prediction. This paper 



follows the recent papers by Lei et al. (2011), Lei and Wasserman (2012), and 



Lei et al. (2012) studying properties of conformal prediction in the batch mode; 



we, however, concentrate on inductive conformal prediction (also considered in 
Lei et al.|2012 ) . The performance of inductive conformal predictors in the batch 
mode is illustrated using the well-known Spambase data set; for earlier empirical 
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Figure 1: Eight notions of conditional validity. The visible vertices of the cube 
are U (unconditional), T (training conditional), O (object conditional), L (label 
conditional), OL (example conditional), TL (training and label conditional), TO 
(training and object conditional). The invisible vertex is TOL (and corresponds 
to conditioning on everything). 



studies of conformal prediction in the batch mode see, e.g. , |Vanderlooy et al.| 
(2007). The conference version of this paper is published as Vovk (2012). 

We will usually be making the assumption of randomness, which is stan- 
dard in machine learning and nonparametric statistics: the available data is a 
sequence of examples generated independently from the same probability distri- 
bution P. (In some cases we will make the weaker assumption of exchangeability; 
for some of our results even weaker assumptions, such as conditional random- 
ness or exchangeability, would have been sufficient.) Each example consists of 
two components: an object and a label. We are given a training set of examples 
and a new object, and our goal is to predict the label of the new object. (If we 
have a whole test set of new objects, we can apply the procedure for predicting 
one new object to each of the objects in the test set.) 

The two desiderata for inductive conformal predictors are their validity and 
efficiency: validity requires that the coverage probability of the prediction sets 
should be at least equal to a preset confidence level, and efficiency requires that 
the prediction sets should be as small as possible. However, there is a wide 
variety of notions of validity, since the "coverage probability" is, in general, 
conditional probability. The simplest case is where we condition on the trivial 
er-algebra, i.e., the probability is in fact unconditional probability, but several 
other notions of conditional validity are depicted in Figure [l] where T refers to 
conditioning on the training set, O to conditioning on the test object, and L 
to conditioning on the test label. The arrows in Figure [l] lead from stronger to 
weaker notions of conditional validity; U is the sink and TOL is the source (the 
latter is not shown). 

Inductive conformal predictors will be defined in Section [2j They are au- 
tomatically valid, in the sense of unconditional validity. It should be said 
that, in general, the unconditional error probability is easier to deal with than 
conditional error probabilities; e.g., the standard statistical methods of cross- 
validation and bootstrap provide decent estimates of the unconditional error 
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probability but poor estimates for the training conditional error probability: 
see |Hastie et al.j ( |2009[ ), Section 7.12. 

In Section |3] we explore training conditional validity of inductive conformal 
predictors. Our simple results (Propositions 2a and 2b ) are of the PAC type, 



involving two parameters: the target training conditional coverage probability 
1 — e and the probability 1 — 6 with which 1 — e is attained. They show that in- 
ductive conformal predictors achieve training conditional validity automatically 
(whereas for other notions of conditional validity the method has to be modi- 
fied) . We give self-contained proofs of Propositions 2a and 2b but Appendix A 
explains how they can be deduced from classical results about tolerance regions. 

In the following section, Section [4j we introduce a conditional version of 
inductive conformal predictors and explain, in particular, how it achieves label 
conditional validity. Label conditional validity is important as it allows the 
learner to control the set-prediction analogues of false positive and false negative 
rates. Section [5] is about object conditional validity and its main result (a 
version of a lemma in Lei and Wasserman |2012[ ) is negative: precise object 
conditional validity cannot be achieved in a useful way unless the test object has 
a positive probability. Whereas precise object conditional validity is usually not 
achievable, we should aim for approximate and asymptotic object conditional 
validity when given enough data (cf. Lei and Wasserman||2012 ) . 

Section [6] reports on the results of empirical studies for the standard 
Spambase data set (see, e.g., Hastie et al.||2009 Chapter 1, Example 1, and 
Section 9.1.2). Section [7] discusses close connections between an important 
class of ICPs and ROC curves. Section [8] concludes and Appendix A discusses 
connections with the classical theory of tolerance regions (in particular, it 
explains how Propositions [2a] and [2b] can be deduced from classical results 
about tolerance regions). 



2 Inductive conformal predictors 

The example space will be denoted Z; it is the Cartesian product X x Y of 
two measurable spaces, the object space and the label space. In other words, 
each example z € Z consists of two components: z = (x,y), where x € X is 
its object and y € Y is its label. Two important special cases are the problem 
of classification, where Y is a finite set (equipped with the discrete cr-algebra), 
and the problem of regression, where Y = R. 

Let (zi, . . . , zi) be the training set, Zj = (a^, yj) € Z. We split it into two 
parts, the proper training set (zi, . . . , z m ) of size m < I and the calibration set 
of size I — m. An inductive conformity m-measure is a measurable function 
A : Z m xZ-)R; the idea behind the conformity score A{fz\, . . . , z m ), z) is that 
it should measure how well z conforms to the proper training set. A standard 
choice is 

A({z u . . . , z m ), {x, y)) := A(y, f(x)), (1) 

where / : X — > Y' is a prediction rule found from (z%, . . . , z m ) as the training set 
and A : Y x Y' — > K is a measure of similarity between a label and a prediction. 
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Allowing Y' to be different from Y (often Y' D Y) may be useful when the 
underlying prediction method gives additional information to the predicted la- 
bel; e.g., the MART procedure used in Section [6] gives the logit of the predicted 
probability that the label is 1. 

Remark. The idea behind the term "calibration set" is that this set allows us 
to calibrate the conformity scores for test examples by translating them into a 
probability-type scale. 

The inductive conformal predictor (ICP) corresponding to A is defined as 
the set predictor 

r*(z 1 ,...,z l ,x):={y\pv > e}, (2) 

where e € [0, 1] is the chosen significance level (1 — e is known as the confidence 
level), the p-values p y , y G Y, are defined by 

„ . = |{i = m + l,...,f|a i <q»}| + l (3) 
I — m + 1 

and 

on := A((zx,...,z m ),Zi), i = m + l,...,l, a y := A((z 1} . . . , z m ), (x, y)) 

(4) 

are the conformity scores. Given the training set and a new object x the ICP 
predicts its label y; it makes an error if y ^ T e (zi, . . . , zj, x). 

The random variables whose realizations are a;,-, yi, z i; z will be denoted by 
the corresponding upper case letters (Xi, Yi, Z%, Z, respectively). The following 
proposition of validity is almost obvious. 

Proposition 1 ( Vovk et al.| 2005, Proposition 4.1). If random exam- 
ples Z m +i,...,Zi, Zi + i = (Xi + i,Yi + i) are exchangeable (i.e., their dis- 
tribution is invariant under permutations), the probability of error Y/+i ^ 
r e (Z l7 . . . , Z\, Xi+i) does not exceed e for any e and any inductive conformal 
predictor T. 

In practice the probability of error is usually close to e (as we will see in 
Section |6| . 



3 Training conditional validity 

As discussed in Section [l] the property of validity of inductive conformal pre- 
dictors is unconditional. The property of conditional validity can be formalized 
using a PAC-type 2-parameter definition. It will be convenient to represent 
the ICP (2| in a slightly different form downplaying the structure (a^yj) of Zj. 
Define T e (zi, . . . , Zi) :— {(x, y) \ p y > e}, where p y is defined, as before, by (|3| 
and Q (therefore, p y depends implicitly on x). Proposition [T] can be restated 
by saying that the probability of error Zi+\ r £ (^i, . . . , Z{) does not exceed e 
provided Z\, . . . , are exchangeable. 
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We consider a canonical probability space in which Z% = (Xj , Yi) , i = 
1, . . . , I + 1, are i.i.d. random examples. A set predictor T (outputting a subset 
of Z given I examples and measurable in a suitable sense) is (e, 6) -valid if, for 
any probability distribution P on Z, 

P l (P(T(Z 1 ,...,Z l )) > 1-e) > 1-5. 

It is easy to see that ICPs satisfy this property for suitable e and 8. 

Proposition 2a. Suppose e,S € [0, 1], 



where n := l—m is the size of the calibration set, andT is an inductive conformal 
predictor. The set predictor T £ is then (E, S)-valid. Moreover, for any probability 
distribution P on Z and any proper training set (z 1; . . . , z m ) € Z m , 

P n (P{T{ Zl , . . . , z m ,Z m+1 , ...,Z l ))>l-e)>l-S. 

This proposition gives the following recipe for constructing (e, (5)-valid set 
predictors. The recipe only works if the training set is sufficiently large; in 
particular, its size I should significantly exceed N := (— In <5)/(2e 2 ). Choose an 
ICP r with the size n of the calibration set exceeding N. Then the set predictor 

Proof of Proposition \2c\ Let E € (e, 1) (not necessarily satisfying (|5|). Fix the 
proper training set {z\, . . . ,z m ). By |2]) and (J3|, the set predictor T e makes an 
error, ^ T e (zi, . . . , zi), if and only if the number of i = m + 1, . . . , I such 
that oti < a v is at most [e(n + 1) — lj ; in other words, if and only if a y < Q!(m, 
where a^k) is the fcth smallest ctt and k := [e(^ + 1) — lj + 1. Therefore, the P- 
probability of the complement of T e (zi, . . . , z{) is P(A((z l7 . . . , z m ), Z) < anX), 
where A is the inductive conformity m-measure. Set 



inf{a | P(A((zi, z m ),Z) < a) > E} and 



P(A(( Zl ,...,z m ),Z) <a*) 
--P(A(( Zl ,...,z m ),Z)<a*). 



The cr-additivity of measures implies that E' < E < E" ', and E' = E = E" 
unless a* is an atom of A((zi, . . . , z m ), Z). Both when E' = E and when 
E' < E, the probability of error will exceed E if an only if cttfy > a* . In other 
words, if only if we have at most k — 1 of the a, below or equal to a*. The 
probability that at most k — 1 = [e(n + 1) — lj values of the a, are below or 
equal to a* equals ¥{B',[ < [e(n + 1) - lj) < P(B n < [e(n + 1) - lj), where 
B[[ ~ bin n B», P„ ~ bin„ : £, and bin„ iJ5 stands for the binomial distribution with 
n trials and probability of success p. (For the inequality, see Lemma [l] below.) 
By Hoeffding's inequality (see, e.g., |Vovk et al.|2005 p. 287), the probability of 



error will exceed E with probability at most 
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F(B n < [e(n + 1) - lj) < F(B n < en) 

= V(B n /n -E<e-E)<e^ (- { " ~^Jf ) = e"^-) 2 ™. (6) 

Solving e~ 2 ( E ~ e ) n — Swe obtain that T £ is (E, (5)-valid whenever (JsJ) is satisfied. 

□ 



In the proof of Proposition 2a we used the following lemma. 

Lemma 1. Fix the number of trials n. The distribution function bin„ p(_ftr) of 
the binomial distribution is decreasing in the probability of success p for a fixed 
K <= {0, ...,n}. 

Proof. It suffices to check that 



is nonpositive for p € (0, 1). The last sum has the same sign as the mean of the 
function f(k) :— k~np over the set k € {0, . . . , K} with respect to the binomial 
distribution, and so it remains to notice that the overall mean of / is and that 
the function / is increasing. □ 

The inequality ^ in Proposition 2a is simple but somewhat crude as its 
derivation uses Hoeffding's inequality. The following proposition is the more 
precise version of Proposition [2a| that stops short of that last step. 

Proposition 2b. Let e,S, E £ [0, 1] . If T is an inductive conformal predictor, 
the set predictor T e is {E , 8) -valid provided 

5>bin„ >B (Le(n + l)-lJ), (7) 

where n := I — m is the size of the calibration set and bin„.£; is the cumulative 
binomial distribution function with n trials and probability of success E. If the 
random variable A((z±, . . . , z m ), Z) is continuous, T £ is (E, 8)-valid if and only 
if [?[) holds. 

Proof. See the left-most expression in Q and remember that E" = E unless 
a* is an atom of A((zi, . . . , z m ), Z). □ 

Remark. The training conditional guarantees discussed in this section are very 
similar to those for the hold-out estimate: compare, e.g., Proposition [2b| above 
and Theorem 3.3 in Langford (2005). The former says that r e is (E, <5)-valid for 

E := bS n , 5 fle(n + 1) - 1J ) < (en) (8) 

where bin is the inverse function to bin: 

hm n ^(k) := max{p | bin„ iP (fc) > 8} 
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(unless k = n, we can also say that bin„ ,s(k) is the only value of p such that 
bin„ iP (fc) = 5: cf. Lemma [l] above). And the latter says that a point predictor's 
error probability (over the test example) does not exceed 

bhw (*) (9) 

with probability at least 1 — S (over the training set), where k is the number 
of errors on a held-out set of size n. The main difference between ([8| and (|9| 
is that whereas one inequality contains the approximate expected number of 
errors en for n new examples the other contains the actual number of errors 
k on n examples. Several researchers have found that the hold-out estimate 
is surprisingly difficult to beat; however, like the ICP of this section, it is not 
example conditional at all. 

Remark. Inequality ^ can be rewritten as 

E>Wi n ^([e(n+l)-l\). 



In combination with inequality 2. in jLangford (2005), p. 278, this shows that 



Proposition 2a will continue to hold if ([5j) is replaced by 



, -2eln<5 2 In 5 

E > e 



The last inequality is weaker than ^ for small e. 



4 Conditional inductive conformal predictors 

The motivation behind conditional inductive conformal predictors is that 
ICPs do not always achieve the required probability e of error Vj+i ^ 
T e (Zi, . . . , Zi,Xi + i) conditional on (Xi + i, YJ+i) € E for important sets E C Z. 
This is often undesirable. If, e.g., our set predictor is valid at the significance 
level 5% but makes an error with probability 10% for men and 0% for women, 
both men and women can be unhappy with calling 5% the probability of error. 
Moreover, in many problems we might want different significance levels for 
different regions of the example space: e.g., in the problem of spam detection 
(considered in Section |6| classifying spam as email usually makes much less 
harm than classifying email as spam. 

An inductive m-taxonomy is a measurable function K : Z m x Z — > K, where 
K is a measurable space. Usually the category K((zi, . . . , z m ), z) of an example 
2 is a kind of classification of z, which may depend on the proper training set 
(zi , . . . , z m ) . 

The conditional inductive conformal predictor (conditional ICP) correspond- 
ing to K and an inductive conformity w-measure A is defined as the set predictor 
([2| , where the p- values p y are now defined by 

y \{i = m + 1, ■ ■ ■ , I | Kj = K v & a t < a"}| + 1 
P ' \{i = m+ 1,...,Z \Ki = KV}\ + l ' 1 j 
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the categories k are defined by 

Ki := K{(zi,...,z m ),Zi), i = m + l,...,l, k v := K((z 1; . . . , z m ), (x, y)), 

and the conformity scores a are defined as before by Q. A label conditional 
ICP is a conditional ICP with the inductive m-taxonomy K(-, (x, y)) := y. 

The following proposition is the conditional analogue of Proposition [T] in 
particular, it shows that in classification problems label conditional ICPs achieve 
label conditional validity. 

Proposition 3. If random examples Z m+ i, . . . , Zi, Zi + \ — (Xi+i, Yj+i) are ex- 
changeable, the probability of error Vj+i (f. T e (Zi, . . . , Zi,Xi + i) given the cate- 
gory K((Zx, . . . , Z m ), Zi + i) of Zi + \ does not exceed e for any e and any condi- 
tional inductive conformal predictor T corresponding to K . 



5 Object conditional validity 



In this section we prove a negative result (a version of Lemma 1 in Lei and 
Wasserman|2012 ) which says that the requirement of precise object conditional 



validity cannot be satisfied in a non-trivial way for rich object spaces (such as 
R). If P is a probability distribution on Z, we let Px stand for its marginal 
distribution on X: Py_(A) := P(A x Y). Let us say that a set predictor T 
has 1 — e object conditional validity, where e € (0,1), if, for all probability 
distributions P on Z and Px-ahnost ah a: G X, 

P l+1 (Y l+1 G T{Z U Z h X l+1 ) | X l+1 =x)>l-e. (11) 

The Lebesgue measure on K will be denoted A. If Q is a probability distribution, 
we say that a property F holds for Q-almost all elements of a set E if Q(E\F) = 
0; a Q-non-atom is an element x such that Q({x}) = 0. 

Proposition 4. Suppose X is a separable metric space equipped with the Borel 
a-algebra. Let e G (0, 1). Suppose that a set predictor V has 1 — e object condi- 
tional validity. In the case of regression, we have, for all P and for Px-almost 
all Py^-non-atoms i£X, 

P l (A(T(Zi, . . . , Z h x)) = oo) > 1 - e. (12) 

In the case of classification, we have, for all P, all y S Y, and Px_-almost all 
Py^-non- atoms x, 

P l (yeT(Z 1 ,...,Z h x))>l-e. (13) 
We are mainly interested in the case of a small e (corresponding to high 



confidence), and in this case (12 1 implies that, in the case of regression, predic- 
tion intervals (i.e., the convex hulls of prediction sets) can be expected to be 
infinitely long unless the new object is an atom. In the case of classification, 



( 13 ) says that each particular y g Y is likely to be included in the prediction 
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set, and so the prediction set is likely to be large. In particular, (13) implies 
that the expected size of the prediction set is a least (1 — e) |Y|. 

Of course, the condition that a; be a non-atom is essential: if Px({x}) > 0, 
an inductive conformal predictor that ignores all examples with objects different 
from x will have 1 — e object conditional validity and can give narrow predictions 
if the training set is big enough to contain many examples with x as their object. 

Remark. Nontrivial set predictors having 1 — e object conditional validity are 
constructed by McCullagh et al. (20091 assuming the Gauss linear model. 

Proof of Proposition [JJ The proof will be based on the ideas of Lei and Wasser- 
man (2012 the proof of Lemma 1). 

Suppose ( 12 ) does not hold on a measurable set E of Px-non-atoms x € X 
such that Px(E) > 0. Shrink E in such a way that Px(E) > still holds but 
there exists 8 > and C > such that, for each x € E, 



P l (A(T(Z 1 ,...,Z l ,x)) <C)>e + 8. 



(14) 



Let V be the total variation distance between probability measures, V(P, Q) 
sup^ \P(A) — Q(A)\; we then have 



V(P l ,Q l )<V2^1-(l-V(P,Q)Y 
(this follows from the connection of V with the Hellinger distance: see, e.g., 



Tsybakov 2010 Section 2.4). Shrink E further so that F*x(E) > still holds 



but 



V2Jl - (1 - P X (E)Y < 8/2. 



(15) 



(This can be done under our assumption that X is a separable metric space: 
see Lemma [2] below.) Define another probability distribution Q on Z by the 
requirements that Q(A x B) = P(A x B) for all measurable A C (X \ E), 
B C K and Q(A x B) = P*(A) x U(B) for all measurable A C E, B C R, 
where U is the uniform probability distribution on the interval [—DC, DC] and 
D > will be chosen below. Since V(P, Q) < Px(E), we have V(P l ,Q l ) < 6/2; 
therefore, by (14 1, 

Q l (A(T(Z 1 ,...,Z h x))<C)>e + 5/2 

for each x G E. The last inequality implies, by Fubini's theorem, 

Q l+1 {A(T(Z U . . . , Z h X l+1 )) <Ck X l+1 eE)>(e + 8/2) Q*(E), 

where Qx(E) = Px(E) > is the marginal Q-probability of E. When D = 
D(8Qx(E), C) is sufficiently large this in turn implies 

Q l+1 (Y l+1 (/ F(Z l7 . . . , Z h X l+l ) &X l+1 eE)>(e + 8/4) Qx(E). 

However, the last inequality contradicts 

Q l+1 (Y l+1 i r(z ls . . . , Z h X l+1 ) & X l+1 e E) 



< e, 
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which follows from T having 1 — e object conditional validity and the definition 
of conditional probability. 



It remains to consider the case of classification. Suppose (13) does not hold 
on a measurable set E of px-non-atoms x £ X such that Px(P) > 0. Shrink E 
in such a way that Px(-E) > still holds but there exists 8 > such that, for 
each 

P l {y£T(Z 1 ,...,Z h x))<l-e-6. 



Without loss of generality we further assume that ( 15 ) also holds. Define a 
probability distribution Q on Z by the requirements that Q(A x B) — P(A x B) 
for all measurable AC[X\B) and all B C Y and that Q(A x {y}) = P x (^) 
for all measurable A C E (i.e., modify P setting the conditional distribution of 
Y given J £ £ to the unit mass concentrated at y). Then for each x £ E we 
have 

Q' feer(4...,4:t)) < l-e-a/2, 

which implies 

Q l+1 (Y l+1 e r(Zi, . . . , Zj, & € P) < (1 - e - 5/2) Q X (P). 

The last inequality contradicts T having 1 — e object conditional validity. □ 

In the proof of Proposition [4] we used the following lemma. 

Lemma 2. If Q is a probability measure on X, which a separable metric space, 
E is a set of Q -non- atoms such that Q(E) > 0, and 6 > is an arbitrarily small 
number, then there is E' C E such that Q(E') < S. 

Proof. We can take the intersection of E and an open ball centered at any 
element of X for which all such intersections have a positive Q-probability. Let 
us prove that such elements exist. Suppose they do not. 

Fix a countable dense subset A\ of X. Let A2 be the union of all open balls 
B with rational radii centered at points in A\ such that Q(B HE) =0. On 
one hand, the a-additivity of measures implies Q(A2 D E) = 0. On the other 
hand, A 2 = X: indeed, for each x £ X there is an open ball B of some radius 
S > centered at x that satisfies Q{B n E) — 0; since x belongs to the radius 
6/2 open ball centered at a point in A\ at a distance of less than 6/2 from x, 
we have x £ Ay,. This contradicts Q(E) > 0. □ 

Proposition W] can be extended to r and omiz ed s et predictors T (in which 
case P L and P /+ in expressions such as (111 and ( 12 ) should be replaced by the 



probability distribution comprising both P and the internal coin tossing of F). 



This clarifies the provenance of e in (12 1 and (13): e cannot be replaced by a 



smaller constant since the set predictor predicting Y with probability 1 — e and 
with probability e has 1 — e object conditional validity. 

Proposition [4] does not prevent the existence of efficient set predictors 
that are conditionally valid in an asymptotic sense; indeed, the paper by |Lei 
and Wasserman| ( |2012 ) is devoted to constructing asymptotically efficient and 



asymptotically conditionally valid set predictors in the case of regression. 
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6 Experiments 



This section describes some simple experiments on the well-known Spambase 
data set contributed by George Forman to the UCI Machine Learning Repository 



(Frank and Asuncion 20101. Its overall size is 4601 examples and it contains 



examples of two classes: email (also written as 0) and spam (also written as 



1). Hastie et al. (2009) report results of several machine-learning algorithms on 
this data set split randomly into a training set of size 3065 and test set of size 
1536. The best result is achieved by MART (multiple additive regression tree; 



4.5% error rate according to the second edition of Hastie et al.|2009 ) 



We randomly permute the data set and divide it into 2602 examples for 
the proper training set, 999 for the calibration set, and 1000 for the test set. 
Our split between the proper training, calibration, and test sets, approximately 
4:1:1, is inspired by the standard recommendation for the allocation of data into 



training, validation, and test sets (see, e.g., Hastie et al.||2009 Section 7.2). We 
consider the ICP whose conformity measure is defined by ([lj) where / is output 
by MART and 

A { yJ( x) ):=\Kf ^ = 1 (16) 
[-f{x) ify = 0. 

MART's output f(x) models the log-odds of spam vs email, 

P(l | x) 



f(x) = log 



P(0 | a:)' 



which makes the interpretation of (161 as conformity score very natural. 

The R programs used in the experiments described in this section are avail- 
able from the web site http : //alr wTnet| the programs use the gbm package 
with virtually all parameters set to the default values (given in the description 
provided in response to help ("gbm")). 

The upper left plot in Figure [5] is the scatter plot of the pairs (p emall ) £ ) s P am ) 
produced by the ICP for all examples in the test set. Email is shown as green 
noughts and spam as red crosses (and it is noticeable that the noughts were 
drawn after the crosses) . The other two plots in the upper row are for email and 
spam separately. Ideally, email should be close to the horizontal axis and spam 
to the vertical axis; we can see that this is often true, with a few exceptions. The 
picture for the label conditional ICP looks almost identical: see the lower row 
of Figure [2] However, on the log scale the difference becomes more noticeable: 
see Figurej3| 

Table [Hgives some statistics for the numbers of errors, multiple, and empty 
set predictions in the case of the (unconditional) ICP T 5% at significance level 5% 
(we obtain different numbers not only because of different splits but also because 
MART is randomized; the columns of the table correspond to the pseudorandom 
number generator seeds 0, 1, 2, etc.). The table demonstrates the validity, (lack 
of) conditional validity, and efficiency of the algorithm (the latter is of course 
inherited from the efficiency of MART). We give two kinds of conditional figures: 
the percentages of errors, multiple, and empty predictions for different labels 
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Email and spam 






0.0 0.2 0.4 0.6 0.8 1.0 

email p-value 



Email and spam 



0.0 0.2 0.4 0.6 0.8 



Email only 



0.6 0.8 



Spam only 



0.0 0.2 0.4 0.6 0.8 

email p-value 



Figure 2: Scatter plots of the pairs (p emall ; p s P am ) f or a ll examples in the test set 
(left plots), for email only (middle), and for spam only (right). The three upper 
plots are for the ICP and the three lower ones are for the label conditional ICP. 



and for two different kinds of objects. The two kinds of objects are obtained 
by splitting the object space X by the value of an attribute that we denote 
$: it shows the percentage of the character $ in the text of the message. The 
condition $ < 5.55% was the root of the decision tree chosen both by Hastic 



et al. (2009 Section 9.2.5), who use all attributes in their analysis, and by 
Maindonald and Braun (2007 Chapter 11), who use 6 attributes chosen by 
them manually. (Both books use the rpart R package for decision trees.) 

Notice that the numbers of errors, multiple predictions, and empty predic- 
tions tend to be greater for spam than for email. Somewhat counter-intuitively, 
they also tend to be greater for "email-like" objects containing few $ characters 
than for "spam-like" objects. The percentage of multiple and empty predictions 
is relatively small since the error rate of the underlying predictor happens to be 
close to our significance level of 5%. 

In practice, using a fixed significance level (such as the standard 5%) is not 
a good idea; we should at least pay attention to what happens at several signifi- 
cance levels. However, experimenting with prediction sets at a fixed significance 
level facilitates a comparison with theoretical results. 

Table [2] gives similar statistics in the case of the label conditional ICP. The 
error rates are now about equal for email and spam, as expected. We refrain from 
giving similar predictable results for "object conditional" ICP with $ < 5.55% 
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Email and spam 



S 

5 5 
I 
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\ 



\ 



0.001 0.010 0.100 1.000 



Email and spam 



0.001 0.010 0.100 1.000 

email p-value 



0.001 0.010 0.100 1.000 



3 e> 

\ 



0.001 0.010 0.100 1.000 

email p-value 




0.001 0.010 0.100 1.000 



Figure 3: The analogue of Figure [2] on the log scale. 



and $ > 5.55% as categories. 

Figure [4] gives the calibration plots of the ICP for the test set. It shows 
approximate validity even for email and spam separately, except for the all- 
important lower-left corners. The latter are shown separately in Figure[5j where 
the lack of conditional validity becomes evident; cf. Figure [6] for the label con- 
ditional ICP. 

From the numbers given in the "errors overall" row of TablefTJwe can extract 
the corresponding confidence intervals for the probability of error conditional on 
the training set and MART's internal coin tosses; these are shown in Figure [7J 
It can be seen that training conditional validity is not grossly violated. (No- 
tice that the 8 training sets used for producing this figure are not completely 
independent. Besides, the assumption of randomness might not be completely 
satisfied: permuting the data set ensures exchangeability but not necessarily 
randomness.) It is instructive to compare Figure [7] with the "theoretical" Fig- 
ure [8] obtained from Propositions 2b (the thick blue line) and 2a (the thin 
red line). The dotted green line corresponds to the significance level 5%, and 
the black dot roughly corresponds to the maximal expected probability of error 
among 8 randomly chosen training sets. (It might appear that there is a dis- 
crepancy between Figures [7] and |HJ but choosing different seeds usually leads to 
smaller numbers of errors than in Figure [7]) 
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RNG seed 





1 


2 


3 


4 


5 


6 


7 


Average 


errors overall 


4.1% 


6.9% 


4.6% 


5.4% 


5.3% 


6.1% 


7.7% 


5.9% 


5.75% 


for email 


2.44% 


4.61% 


2.26% 


3.10% 


4.49% 


3.98% 


5.02% 


3.22% 


3.64% 


for spam 


6.77% 


10.43% 


8.42% 


9.02% 


6.53% 


9.32% 


11.69% 


10.29% 


9.06% 


for $ < 5.55% 


4.36% 


7.91% 


5.15% 


6.21% 


6.27% 


7.89% 


8.79% 


7.04% 


6.70% 


for $ > 5.55% 


3.29% 


4.12% 


2.69% 


2.64% 


2.40% 


1.13% 


4.42% 


2.15% 


2.86% 


multiple overall 


2.7% 


0% 


0.1% 


0% 


0% 


0.5% 


0% 


0% 


0.41% 


for email 


2.11% 


0% 


0.16% 


0% 


0% 


0.33% 


0% 


0% 


0.33% 


for spam 


3.65% 


0% 


0% 


0% 


0% 


0.76% 


0% 


0% 


0.55% 


for $ < 5.55% 


3.04% 


0% 


0.13% 


0% 


0% 


0.68% 


0% 


0% 


0.48% 


for $ > 5.55% 


1.65% 


0% 


0% 


0% 


0% 


0% 


0% 


0% 


0.21% 


empty overall 


0% 


2.7% 


0% 


1.2% 


0.8% 


0% 


2.5% 


0.4% 


0.95% 


for email 


0% 


1.48% 


0% 


0.65% 


0.83% 


0% 


1.51% 


0.64% 


0.64% 


for spam 


0% 


4.58% 


0% 


2.06% 


0.75% 


0% 


3.98% 


0% 


1.42% 


for $ < 5.55% 


0% 


3.14% 


0% 


1.55% 


0.80% 


0% 


3.06% 


0.52% 


1.13% 


for $ > 5.55% 


0% 


1.50% 


0% 


0% 


0.80% 


0% 


0.80% 


0% 


0.39% 



Table 1: Percentage of errors, multiple predictions, and empty predictions on 
the full test set and separately on email and spam. The results are given for 
various values of the seed for the R (pseudo)random number generator (RNG); 
column "Average" gives the average values for all 8 seeds 0-7. 



7 ICPs and ROC curves 



This section will discuss a close connection between an important class of ICPs 
( "probability- type" label conditional ICPs) and ROC curves. (For a previous 
study of connection between conformal prediction and ROC curves, 



derlooy and Sprinkhuizen-Kuyper 2007 ) Let us say that an ICP or a label 



Van- 



conditional ICP is probability-type if its inductive conformity measure is defined 
by ([I]) where / takes values in K and A is defined by (16). 



The reader might have noticed that the two leftmost plots in Figure [2] look 
similar to a ROC curve. The following proposition will show that this is not 
coincidental in the case of the lower left one. However, before we state it, we 



RNG seed 





1 


2 


3 


4 


5 


6 


7 


Average 


errors overall 


3.4% 


6.0% 


3.8% 


4.8% 


5.7% 


5.3% 


6.5% 


5.4% 


5.11% 


for email 


3.73% 


6.92% 


3.87% 


4.90% 


6.64% 


4.98% 


5.85% 


3.86% 


5.10% 


for spam 


2.86% 


4.58% 


3.68% 


4.64% 


4.27% 


5.79% 


7.46% 


7.92% 


5.15% 


multiple overall 


4.2% 


0% 


4.0% 


0% 


0% 


0.5% 


0% 


0.5% 


1.15% 


for email 


3.90% 


0% 


5.48% 


0% 


0% 


0.66% 


0% 


0.48% 


1.32% 


for spam 


4.69% 


0% 


1.58% 


0% 


0% 


0.25% 


0% 


0.53% 


0.88% 


empty overall 


0% 


1.0% 


0% 


0% 


0.6% 


0% 


1.0% 


0% 


0.33% 


for email 


0% 


1.48% 


0% 


0% 


0.83% 


0% 


0.67% 


0% 


0.37% 


for spam 


0% 


0.25% 


0% 


0% 


0.25% 


0% 


1.49% 


0% 


0.25% 



Table 2: The analogue of a subset of Table [T] in the case of the label conditional 
ICP. 
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Overall calibration plot 



Calibration plot for email 



Calibration plot for spam 




Overall calibration plot Calibration plot for email Calibration plot for spam 




0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 0.00 0.05 0.10 0.15 0.20 

significance level significance level significance level 



Figure 5: The lower left corners of the plots in Figure |4j 



need a few definitions. We will now consider a general binary classification 
problem and will denote the labels as and 1. For a threshold eel, the type 
I error on the calibration set is 

{i = 77i + 1,. ■ • ,1 | f(xi) > c& y l = 0} 
a(G) := {i = m+l,...,l\y i = 0} (17) 

and the type II error on the calibration set is 

p,s , = {i = m+l,...,l\ fjxj) <ckyj = l} 
{i = m + 1, | yi = 1} 

(with 0/0 set, e.g., to 1/2). Intuitively, these are the error rates for the classifier 
that predicts 1 when f(x) > c and predicts when f(x) < c; our definition is 
conservative in that it counts the prediction as error whenever f(x) — c. The 
ROC curve is the parametric curve 

{(a( C ),/3(c))|ce]R}C[0,l] 2 . (19) 

(Our version of ROC curves is the original version reflected in the line y = 1/2; 
in our sloppy terminology we follow |Hastie et al.| 12009} whose version is the 
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Overall calibration plot 



Calibration plot for email 



Calibration plot for spam 





1 2 3 4 5 6 7 
seed 

Figure 7: Confidence intervals for training conditional error probabilities: 95% 
in black (thin lines) and 80% in blue (thick lines). The 5% significance level is 
shown as the horizontal red line. 



original one reflected in the line x — 1/2, and many other books and papers; 
see, e.g., Bengio et al.|2005 Figure 1.) 



Proposition 5. In the case of a probability-type label conditional ICP, for any 
object x £ X, the distance between the pair (p ,^ 1 ) (see (10)) and the ROC 
curve is at most 



1 



1 



Y (n° + l) 2 (r^ + l) 2 ' 
where n y is the number of examples in the calibration set labelled as y. 
Proof. Let c := f(x). Then we have 

(p"V) = 



(20) 




(21) 



where n> is the number of examples (xi, yi) in the calibration set such that yi = 
and f(xi) > c and ro< is the number of examples in the calibration set such 
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s 

Figure 8: The probability of error E vs S from Propositions 2b (the thick blue 
line) and 2a (the thin red line), where e = 0.05 and n — 999. 




Figure 9: The lower left corner of the lower left plot of Figure [2] with the 
empirical (solid blue), minimax (dashed blue), and Laplace (dotted blue) ROC 
curves. 



that yi = 1 and f(x{) < c. It remains to notice that the point (n>/n°, n^/n 1 ) 
belongs to the ROC curve: the horizontal (resp. vertical) distance between this 
point and (21) does not exceed l/(n° + 1) (resp. l/(n 1 + 1)), and the overall 
Euclidean distance does not exceed (20 1. □ 



So far we have discussed the empirical ROC curve: (17) and (18) are the 
empirical probabilities of errors of the two types on the calibration set. It 
corresponds to the estimate k/n of the parameter of the binomial distribution 
based on observing k successes out of n. The minimax estimate is (fc + l/2)/(n+ 
1), and the corresponding ROC curve (19) where a(c) and (3(c) are defined 



by (17) and (18) with the numerators increased by \ and the denominators 



increased by 1 will be called the minimax ROC curve. Notice that for the 



minimax ROC curve we can put a coefficient of = in front of (20). Similarly 



when using the Laplace estimate (k + l)/(n + 2), we obtain the Laplace ROC 
curve. See Figure [9] for the lower left corner of the lower left plot of Figure [2] 
with different ROC curves added to it. 
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In conclusion of our study of the Spambase data set, we will discuss the 
asymmetry of the two kinds of error in spam detection: classifying email as spam 
is much more harmful than letting occasional spam in. A reasonable approach 
is to start from a small number e > 0, the maximum tolerable percentage of 
email classified as spam, and then to try to minimize the percentage of spam 
classified as email under this constraint. The standard way of doing this is to 
classify a message x as spam if and only if j(x) > c, where c is the point on 
the ROC curve corresponding to the type I error e. It is not clear what this 
means precisely, since we only have access to an estimate of the true ROC curve 
(and even on the true ROC curve such a point might not exist). But roughly, 
this means classifying x as spam if f(x) exceeds the kth largest value in the set 
{cti | i £ {m + 1, ...,/}& j/i = email}, where k is close to en and n° is the 
size of this set (i.e., the number of email in the calibration, or validation, set). 
To make this more precise, we can use the "one-sided label conditional ICP" 
classifying x as spam if and only i^p° < e for x. According to (21 1, this means 
that we classify x as spam if and only if f(x) exceeds the fcth largest value in 
the set {oti \ i G {m + 1, ...,/}& y,- = email}, where k :— [e(n a + 1)J . The 
advantage of this version of the standard method is that it guarantees that the 
probability of mistaking email for spam is at most e (see Proposition [3]) and also 
enjoys the training conditional version of this property given by Proposition |2a| 
(more accurately, its version for label conditional ICPs). 



8 Conclusion 

The goal of this paper has been to explore various versions of the requirement 
of conditional validity. With a small training set, we have to content ourselves 
with unconditional validity (or abandon any formal requirement of validity alto- 
gether) . For bigger training sets training conditional validity will be approached 
by ICPs automatically, and we can approach example conditional validity by 
using conditional ICPs but making sure that the size of a typical category does 
not become too small (say, less than 100). In problems of binary classification, 
we can control false positive and false negative rates by using label conditional 
ICPs. 

The known property of validity of inductive conformal predictors (Propo- 
sition [T]) can be stated in the traditional statistical language (see, e.g., Fraser 



1957| and |Guttman 1970 1 by saying that they are 1 — e expectation tolerance 



regions, where e is the significance level. In classical statistics, however, there 
are two kinds of tolerance regions: 1 — e expectation tolerance regions and PAC- 



type 1 — 5 tolerance regions for a proportion 1 — e, in the terminology of Fraser 



(1957). We have seen (Proposition 2a) that inductive conformal predictors are 



tolerance regions in the second sense as well (cf. Appendix A). 



1 In practice, we might want to improve the predictor by adding another step and changing 
the classification from spam to email if p 1 is also small, in which case x looks neither like spam 
nor email. In view of Proposition]!}] however, this step can be disregarded for probability-type 
ICP unless e is very lax. 
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A disadvantage of inductive conformal predictors is their potential predictive 
inefficiency: indeed, the calibration set is wasted as far as the development of the 
prediction rule / in is concerned, and the proper training set is wasted as far 
as the calibration ^ of conformity scores into p- values is concerned. Conformal 
predictors use the full training set for both purposes, and so can be expected 
to be significantly more efficient. (There have been reports of comparable and 
even better predictive efficiency of ICPs as compared to conformal predictors 
but they may be unusual artefacts of the methods used and particular data 
sets.) It is an open question whether we can guarantee training conditional 
validity under ^ or a similar condition for conformal predictors different from 
classical tolerance regions. Perhaps no universal results of this kind exist, and 
different families of conformal predictors will require different methods. 
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A Training conditional validity for classical tol- 
erance regions 



In this appendix we compare Propositions 2a and 2b with the results (see, e.g., 
Fraser 1957 and Guttman||1970 ) about classical tolerance regions (which are a 



special case of conformal predictors, as explained in Vovk et al. 2005| p. 257) 



It is well known that under appropriate continuity assumptions the classical 
tolerance regions that discard e(n + 1) out of the n + 1 statistically equivalent 
blocks (in this appendix we always assume that e(n + 1) is an integer number) 
have coverage probability following the beta distribution with parameters (1 — 
e)(n + l) and e(n+l) (see, e.g., |Tukey|1947| or [Guttman|1970"l Theorems 2.2 and 
2.3); in particular, their expected coverage probability is 1 — e. This immediately 
implies the following corollary: if T is a classical tolerance predictor with sample 
size n and expected coverage probability 1 — e, it is (E, <5)-valid if and only if 

5 > Bet(!_ e )(„ +1 )^(„ +1 )(l — E) = 1 — Bet £ ( n+1 ) i ( 1 _ e )(„ +1 )(£ l ), (22) 

where Bet aj ^ is the cumulative beta distribution function with parameters a 
and j3. 



The following lemma shows that in fact ( 22 ) coincides with the condition 
Q for ICPs (under our assumption e(n + 1) G Z). Of course, n means different 
things in Q and ([22]): the size of the calibration set in the former and the size 
of the full training set in the latter. 



Lemma 3 ( |http : //dlmf . nist . gov/8 . 17 . E5 ). For all n € {1,2, . . .}, all k e 
{0, 1, ... , n}, and all E e [0, 1], 



bin ni _B(fc - 1) = Bet„ + i_ fcjfe (l -E) = l- Bct fe ,, i+ i_ fe (E). 



(23) 



Proof. The equality between the last two terms of (|23|) is obvious. The last 



term of ( 23 ) is the probability that the fcth smallest value in a sample of size n 
from the uniform probability distribution U on [0, 1] exceeds E. This event is 
equivalent to at most k — 1 of n independent random variables generated from 
U belonging to the interval [0, E], and so the probability of this event is given 
by the first term of ( 23 1 . □ 



The assumption of continuity was removed by Tukey ( 1948 1 and Fraser and 



Wormleighton (1951). We will state this result only for the simplest kind of 



classical tolerance regions, essentially those introduced by Wilks ( 1941 ) (this 
special case was obtained already by |Scheffe and T ukey 1945 p. 192). Suppose 
the object space X is a one-element set and the label space is Y = R (therefore, 
we consider the problem of predicting real numbers without objects). For two 
numbers L < U in the set {0, 1, . . . , n + 1} consider the set predictor [i/m, yiu)], 
where yn\ is the zth order statistics (the ith smallest value in the training set 
(yi, . . . , y n ), except that j/( ) : = — 00 an d VOn+i) := °°)- This set predictor is 
(E, <5)-valid provided we have (22 ) with e(n + 1) replaced by L + 



1 - U. 



It is easy to see that Proposition 2b (and, therefore, Proposition 2a) can 
in fact be deduced from Scheffe and Tukey's result. This follows from the 
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interpretation of inductive conformal predictors as a "conditional" version of 
Wilks's predictors corresponding to L := e(n+l) and U := n+1. After observing 
the proper training set we apply Wilks's predictors to the conformity scores Qj 
of the calibration examples to predict the conformity score of a test example; 
the set prediction of the conformity score for the test object is transformed into 
the prediction set consisting of the labels leading to a score in the predicted 
range. 
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